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Blending and Bagging 


Roadmap 
@ Embedding Numerous Features: Kernel Models 


Lecture 6: Support Vector Regression 


kernel ridge regression (dense) via 
ridge regression + representer theorem; 
support vector regression (sparse) via 
regularized tube error + Lagrange dual 












@ Combining Predictive Features: Aggregation Models 


Lecture 7: Blending and Bagging 
Motivation of Aggregation 
Uniform Blending 

Linear and Any Blending 
Bagging (Bootstrap Aggregation) 





© Distilling Implicit Features: Extraction Models 
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Blending and Bagging Motivation of Aggregation 


An Aggregation Story 
Your T friends g;,--- , gr predicts whether stock will go up as g(x). | 


select the most trust-worthy friend from their usual performance 
—validation! 


mix the predictions from all your friends uniformly 
—let them vote! 


mix the predictions from all your friends non-uniformly 
—let them vote, but give some more ballots 


combine the predictions conditionally 
— if [t satisfies some condition] give some ballots to friend t 


aggregation models: mix or combine 
hypotheses (for better performance) | 
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Blending and Bagging Motivation of Aggregation 


Aaareaation with Math Notations 


Your T friends g;,--- , gr predicts whether stock will go up as g;(x). 


e select the most trust-worthy friend from their usual performance 
G(x) = gi, (x) with t = argminjesy 2... r} Evai(g ) 
e mix the predictions from all your friends uniformly 
G(x) = sign (D 1 - a(x) 
e mix the predictions from all your friends non-uniformly 
G(x) = sin( ZL at: g(x) with a; > 0 
e include select: a; = |Eva(g; ) smallest] 
e include uniformly: a; = 1 
e combine the predictions conditionally 
G(x) = sign (So q(x) -91(x)) with g:(x) > 0 


e include non-uniformly: q;(x) = at 


aggregation models: a rich family 
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Blending and Bagging Motivation of Aggregation 


Recall: Selection by Validation 


G(x) = gs, (X) with t = argmin Evai(g;) 
te{1,2,--,T} 








e simple and popular 
e what if use En(gt) instead of Evyai(g; )? 
complexity price on dvc, remember? :-) 
e need one strong g; to guarantee small Eya, (and small Eout) 


selection: 

rely on one strong hypothesis 
aggregation: 

can we do better with many 
(possibly weaker) hypotheses? 





Blending and Bagging Motivation of Aggregation 


Why Might Aggregation Work? 
































e mix different weak e mix different random-PLA 
hypotheses uniformly hypotheses uniformly 
—G(x) ‘strong’ —G(x) ‘moderate’ 

e aggregation e aggregation 


= feature transform (?) ==> regularization (7?) 


proper aggregation —> better performance 
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Blending and Bagging Motivation of Aggregation 


Fun Time 


Consider three decision stump hypotheses from R to {—1, +1}: 
gı (x) = sign(1 — x), go(x) = sign(1 + x), g3(x) = —1. When mixing 
the three hypotheses uniformly, what is the resulting G(x)? 


@ 2[Ix| < 1] -1 
@ 2x] > 1] -1 
© 2[x < -1] -1 


© 2[x > +1] -1 





Blending and Bagging Motivation of Aggregation 


Fun Time 


Consider three decision stump hypotheses from R to {—1, +1}: 
gı(x) = sign(1 — x), go(x) = sign(1 + x), g3(x) = —1. When mixing 
the three hypotheses uniformly, what is the resulting G(x)? 

© 2[|x| < 1] -1 

Ə 2[|x| > 1] -1 

© 2[x < -1] -1 

© 2[x >+1] -1 








Reference Answer: a 


The ‘region’ that gets two positive votes 

from g; and gp is |x| < 1, and thus G(x) is 
positive within the region only. We see that the 
three decision stumps g; can be aggregated to 
form a more sophisticated hypothesis G. 
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Blending and Bagging Uniform Blending 


Uniform Blending (Voting) for Classification 
uniform blending: known g;, each with 1 ballot 
T 


G(x) = sign| $71 - g(x) 
=i 






e same g; (autocracy): 
as good as one single 9; 





e very different g; (diversity + democracy): 
majority can correct minority 

e similar results with uniform voting for 
multiclass 


G(x) = sora [ox(x) = k] 


1<k<K 

















how about regression? | 
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Blending and Bagging Uniform Blending 


Uniform Blending for Regression 


G(x) = rae | 


e same g; (autocracy): 
as good as one single 9; 
e very different g; (diversity + democracy): 
some g;(x) > f(x), some g(x) < f(x) 
==> average could be more accurate than individual 





diverse hypotheses: 


even simple uniform blending 
can be better than any single hypothesis 





Blending and Bagging Uniform Blending 


Theoretical Analysis of Uniform Blending 





¢ — 2g:f + f°) 
F) —2Gf +f 
= avg(g?) —-G+(G-f) 
7) —2G? + G?+(G- fy? 
? —2%G+ G*) +(G- f? 
(g: - GP) + (G- f}? 





avg (Eow(g:)) = avg (Elg: - G)*) + Eou(G) 
> Sh Eout(G) 





Blending and Bagging Uniform Blending 


Some Special g: 
consider a virtual iterative process that for t = 1,2,..., T 
@ request size-N data D; from PN (i.i.d.) 
@ obtain g; by A(D:) 


i . i 1 
g9 = lim G= lim Toa A(D) 


T-00 Too 





avg (Eou(g)) = avg (Elg: — 9)*) + Eout(d) 
expected performance of A = expected deviation to consensus 
+performance of consensus 


e performance of consensus: called bias 
e expected deviation to consensus: called variance 





uniform blending: 
reduces variance for more stable performance 
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Blending and Bagging Uniform Blending 


Consider applying uniform blending G(x) = r g(x) on linear 
regression hypotheses g(x) = innerprod(wr, x). Which of the following 
property best describes the resulting G(x)? 


@ a constant function of x 
@ a linear function of x 

© a quadratic function of x 
© none of the other choices 





Blending and Bagging Uniform Blending 


Consider applying uniform blending G(x) = Ss g(x) on linear 
regression hypotheses g(x) = innerprod(w;, x). Which of the following 


property best describes the resulting G(x)? 
© a constant function of x 
@ a linear function of x 
© a quadratic function of x 
© none of the other choices 


Reference Answer: A 


7 
À 1 
G(x) = innerprod (+ Ym] 
which is clearly a linear function of x. Note that 
we write ‘innerprod’ instead of the usual 
‘transpose’ notation to avoid symbol conflict 
with T (number of hypotheses). 
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Blending and Bagging Linear and Any Blending 


Linear Blending 
linear blending: known g;, each to be given «; ballot 


: 
G(x) = sign (£ at ai) with a; > 0 


t=1 









computing ‘good’ a; : il) Enla) 
ata 









LinReg + transformation 


1 d £ 
min N SS (. i 3 wigi(Xn) 
n=1 i=1 


linear blending for regression 


2 
ee i 
min y z [ = 2 narn) 















like two-level learning, remember? :-) 


linear blending = LinModel + hypotheses as transform + constraints | 





Blending and Bagging Linear and Any Blending 


Constraint on a; 
linear blending = LinModel + hypotheses as transform + constraints: 


N T 
aes 
m N 2 err [vn 2. aiglo) 


t=1 














linear blending for binary classification 





ifa;<O0 = atgı(x) = Jarl (—g:(x)) 
e negative a; for gi = positive |a;| for — gt 
e if you have a stock up/down classifier with 99% error, tell me! 


:-) 


in practice, often 
linear blending = LinModel + hypotheses as transform +errstaimis 
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Blending and Bagging Linear and Any Blending 
Linear Blending versus Selection 
in practice, often 


Gi E Higgs E As... Or = HT 


by minimum Ein 





e recall: selection by minimum Ein 
fs 

—best of best, paying dvc (U n) 
t=1 


e recall: linear blending includes selection as special case 
—by setting a; = [Evai(g; ) smallest] 
e complexity price of linear blending with Ein (aggregation of best): 


7 
>dyc ( U n) 
t=1 


like selection, blending practically done with 
(Evai instead of Ein) + (g; from minimum Erain) 
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Blending and Bagging Linear and Any Blending 


Any Blending 


Given 9g; , 95, --+5 97 from Drain, transform (Xp, Yn) in Dyar to 
(Zn = ® (Xn), Yn), where ® (x) = (9; (X), cas , 97 (X)) 








Linear Blending 
© compute a 
= LinearModel ({(2n, yn)}) 


(2) return Gina (X) = 
LinearHypothesis,, (®(x)), 


Any Blending ( 
© compute 9 

= AnyModel ({ (Zn, yn)}) 
@ return Ganya(X) = 9(P(x)), 










where (x) = (91(X),..-, 97(X)) 


any blending: 
e powerful, achieves conditional blending 
e but danger of overfitting, as always :-( 
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Blending and Bagging Linear and Any Blending 


Blending in Practice 


Single 
otis Test-Set Post- 
Val oet Blending Processing 
Blending 


(Chen et al., A linear ensemble of individual and blended 









Data-Set 





models for music rating prediction, 2012) 





KDDCup 2011 Track 1: World Champion Solution by NTU 


e validation set blending: a special any blending model 
Etest (Squared): 519.45 => 456.24 
—helped secure the lead in last two weeks 
e test set blending: linear blending using Etest 
Eiest (Squared): 456.24 — 442.06 
—helped turn the tables in last hour 





blending ‘useful’ in practice, 


despite the computational burden 
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Blending and Bagging Linear and Any Blending 


Fun Time 


Consider three decision stump hypotheses from R to {—1,+1}: 
gı (x) = sign(1 — x), ge(x) = sign(1 + x), g3(x) = —1. When x = 0, 


what is the resulting ®(x) = (9;(X), 92(X), 93(x)) used in the returned 
hypothesis of linear/any blending? 


© (+1,+1,+1) 
(2) laly =] 





Blending and Bagging Linear and Any Blending 


Fun Time 


Consider three decision stump hypotheses from R to {—1, +1}: 
91(X) = sign(1 — x), go(x) = sign(1 + x), g3(x) = —1. When x = 0, 
what is the resulting ®(x) = (9;(X), 92(X), 93(x)) used in the returned 
hypothesis of linear/any blending? 
@ (4+1,+1,+1) 
@ (+1,+1,-1) 
© (+1,-1 a) 
© (—1,-1,-1) 


Reference Answer: ©) 


Too easy? :-) | 
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Blending and Bagging Bagging (Bootstrap Aggregation) 
What We Have Done 
blending: aggregate after getting gr; 
learning: aggregate as well as getting gt 














aggregation type blending learning 
uniform voting/averaging ? 
non-uniform linear ? 
conditional stacking ? 














learning g: for uniform aggregation: diversity important 
e diversity by different models: g4 € H1, 92 E€ H2,..., 9r EHT 
e diversity by different parameters: GD with 7 = 0.001, 0.01, ..., 10 
e diversity by algorithmic randomness: 
random PLA with different random seeds 
e diversity by data randomness: 
within-cross-validation hypotheses gy 








next: diversity by data randomness without g` 
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Blending and Bagging Bagging (Bootstrap Aggregation) 


Revisit of Bias-Variance 


expected performance of A = expected deviation to consensus 
+performance of consensus 
consensus 9 expected gr from D; ~ PN 





e consensus more stable than direct A(D), 
but comes from many more D; than the D on hand 
e want: approximate g by 
e finite (large) T 
e approximate g; = A(D;) from D; ~ P™ using only D 


bootstrapping: a statistical tool that 
re-samples from D to ‘simulate’ D; 





Blending and Bagging Bagging (Bootstrap Aggregation) 


Bootstrap Aggregation 










bootstrapping 


bootstrap sample D;: re-sample N examples from D uniformly with 
replacement—can also use arbitrary N’ instead of original N 


virtual aggregation 


consider a virtual! iterative 
process that for t = 1,2,..., T 
© request size-N data D; 
from PN (i.i.d.) 
® obtain g; by A(D;) 
G = Uniform({g9:}) 





aggregation 
consider a physical iterative 

process that for t = 1,2,..., T 
@ request size-N’ data D; 

from bootstrapping 

@ obtain g: by A(D;) 

G = Uniform({g:}) 





bootstrap aggregation (BAGging): 
a simple meta algorithm 
on top of base algorithm A 
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Blending and Bagging Bagging (Bootstrap Aggregation) 


Bagging Pocket in Action 

















Teocket = 1000; Teac = 25 


e very diverse g; from bagging 
e proper non-linear boundary after aggregating binary classifiers | 
bagging works reasonably well if base 
algorithm sensitive to data randomness | 


Blending and Bagging Bagging (Bootstrap Aggregation) 


Fun Time 


When using bootstrapping to re-sample N examples Dr from a data set 
D with N examples, what is the probability of getting D; exactly the 
same as D? 


@o /NY=0 
@1 /NN 
© N! /NN 
@ NN/NN =1 





Blending and Bagging Bagging (Bootstrap Aggregation) 


Fun Time 


When using bootstrapping to re-sample N examples D; from a data set 
D with N examples, what is the probability of getting D; exactly the 
same as D? 





@o /NY=0 
@1 /N" 
© N! /N" 
@ NN/NN =1 





Reference Answer: © 


Consider re-sampling in an ordered manner for 
N steps. Then there are (N) possible 
outcomes D;, each with equal probability. Most 
importantly, (N!) of the outcomes are 
permutations of the original D, and thus the 
answer. 
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Summary 


@ Embedding Numerous Features: Kernel Models 
© Combining Predictive Features: Aggregation Models 


Lecture 7: Blending and Bagging 


e Motivation of Aggregation 
aggregated G strong and/or moderate 
e Uniform Blending 
diverse hypotheses, ‘one vote, one value’ 
ə Linear and Any Blending 
two-level learning with hypotheses as transform 
ə Bagging (Bootstrap Aggregation) 
bootstrapping for diverse hypotheses 





e next: getting more diverse hypotheses to make G strong 


© Distilling Implicit Features: Extraction Models 





